From Marbles to Daxes

An Introduction to HBMs and their Application to Category Learning”

Jan Luca Schnatz

Hierarchical Beta-Binomial Model

Motivating Example

Repeatedly draw from bags of black and white marbles with unknown proportion of black marbles:

  1. What color woud you predict for the next marble drawn from bag 8?
  2. How did you arrive at that prediction?

Intuition

  • One black marble alone gives little information about the color future marbles
  • But seeing many mostly-black or mostly-white bags before makes that single black marble highly informative

\(\rightarrow\) High chance of next marbles also being black!

Intuition

  • Hierarchy
    • Information is shared across bags at higher levels
    • Observations from previous bags shape strong priors
    • These priors influence predictions about new bags

Goal

We want to build a Bayesian model that reverse-engineers the mind‘s reasoning about color distributions across bags.

Formalisierung des Problems

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Formalisierung des Problems

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Level 1 – Data

\(d_i: \big\{y_i, n_i \big\}\)

Formalisierung des Problems

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Level 1 – Data

\(d_i: \big\{y_i, n_i \big\}\)

Level 2 – Bag-specific distribution

\(y_i ~ \big| ~ n_i \sim \text{Binom}(\theta_i)\)

Formalisierung des Problems

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Level 1 – Data

\(d_i: \big\{y_i, n_i \big\}\)

Level 2 – Bag-specific distribution

  • \(\theta_i\): Probability of drawing a black marble in bag \(i\)
  • Different bags can have different probabilities \(\theta_i\)

\(y_i ~ \big| ~ n_i \sim \text{Binom}(\theta_i)\)


Formalisierung des Problems

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Level 1 – Data

\(d_i: \big\{y_i, n_i \big\}\)

Level 2 – Bag-specific distribution

\(y_i ~ \big| ~ n_i \sim \text{Binom}(\theta_i)\)

Level 3 – General knowledge about bags

\(\theta_i \sim \text{Beta}(\alpha, \beta)\)

Formalisierung des Problems

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Level 1 – Data

\(d_i: \big\{y_i, n_i \big\}\)

Level 2 – Bag-specific distribution

\(y_i ~ \big| ~ n_i \sim \text{Binom}(\theta_i)\)

Level 3 – General knowledge about bags

Reparamerization as

  • Expected value \(\frac{\alpha}{\alpha + \beta}\) of \(\theta_i\)
  • Precision \(\alpha + \beta\) capturing concentration of probability mass around its mean (inverse of variance)

\(\theta_i \sim \text{Beta}(\alpha, \beta)\)

Formalisierung des Problems

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Level 1 – Data

\(d_i: \big\{y_i, n_i \big\}\)

Level 2 – Bag-specific distribution

\(y_i ~ \big| ~ n_i \sim \text{Binom}(\theta_i)\)

Level 3 – General knowledge about bags

\(\theta_i \sim \text{Beta}(\alpha, \beta)\)

Level 4 – Hyperparameters

\(\frac{\alpha}{\alpha + \beta} \sim \text{Unif}(0, 1)\)

\(\alpha + \beta \sim \text{Exp}(1)\)

Formalisierung des Problems

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Level 1 – Data

\(d_i: \big\{y_i, n_i \big\}\)

Level 2 – Bag-specific distribution

\(y_i ~ \big| ~ n_i \sim \text{Binom}(\theta_i)\)

Level 3 – General knowledge about bags

\(\theta_i \sim \text{Beta}(\alpha, \beta)\)

Level 4 – Hyperparameters

  • Prior of Beta Distribution
  • Uniform prior of \(\frac{\alpha}{\alpha + \beta}\) implies that every average probability of drawing a black marbles is equally likely prior to seeing any data
  • Exponential distribution of \(\alpha + \beta\) implies that smaller values are more likely prior to seeing any data

\(\frac{\alpha}{\alpha + \beta} \sim \text{Unif}(0, 1)\)

\(\alpha + \beta \sim \text{Exp}(1)\)

Formalisierung des Problems

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Level 1 – Data

\(d_i: \big\{y_i, n_i \big\}\)

Level 2 – Bag-specific distribution

\(y_i ~ \big| ~ n_i \sim \text{Binom}(\theta_i)\)

Level 3 – General knowledge about bags

\(\theta_i \sim \text{Beta}(\alpha, \beta)\)

Level 4 – Hyperparameters

\(\frac{\alpha}{\alpha + \beta} \sim \text{Unif}(0, 1)\)

\(\alpha + \beta \sim \text{Exp}(1)\)

Posterior Inference of HBM

Applying Bayes Formula to HBM

\[ \begin{gathered} \overbrace{P(\theta, \alpha, \beta ~ | ~ y)}^{\text{Posterior}} \propto \underbrace{P(\alpha, \beta)}_{\text{Hyperprior}} \overbrace{P(\theta ~ | ~ \alpha, \beta)}^{\text{Conditional Prior}} \underbrace{P(y ~ | ~ \theta, \alpha, \beta)}_{\text{Likelihood}} \end{gathered} \]

Posterior inference regarding \(\theta_i\) by integrating out \(\alpha\) and \(\beta\)

\[ \begin{align*} P(\theta_i ~ | ~ d_1, \dots, d_n) = \iint P(\theta_i ~ | ~ \alpha, \beta, d_i) P(\alpha, \beta ~ | ~ d_1, \dots, d_n) \,d\alpha \,d \beta \end{align*} \]

Applying the Model to the Marbles Example

Interim Summary

Key Takeaway

Marble example demonstrated that HBMs nicely align with our intuition how structured data can be used to form strong overhypotheses.


Why does this matter?

This abstract knowledge is what enables rapid learning from sparse data and one-shot generalization.

Application of HBMs to
Category Learning

Motivating Example

A mother points to an unfamiliar object lying on the counter and tells her child that this is a pen.

Question

By which features do children generalize the concept of a pen and recognize future instances of a pen as a pen?

  • In principle, the child could generalize the word to objects with the same material, same color, same texture, or simply objects lying on the counter
  • But empirically, children tend to generalize the new word to other objects that share the shape

Shape Bias

The expectation that members of a category tend to be similar in shape, which is learned by the age of 24 months (Smith et al., 2002).

Model Adaption

Overview of Changes
Marble Example
Shapes Example
Structuring Variable Bag Object Category
Data Marble Object Exemplar
Features Color Shape, Color, Texture, Size, etc.
Feature Values Binary Categorical
  • Level 1: Binary Observations \(\rightarrow\) Categorical Observations
  • Level 2: Binomial Distribution \(\rightarrow\) Multinomial Distribution
  • Level 3: Beta Distribution Prior \(\rightarrow\) Dirichlet Prior
  • Level 4: Hyperprior similar to before

Copies of level 2-4. for each Feature Dimension (Color, Shape, Texture, Size)

Model Adaption

  • The model infers that categories are consistent in shape (low variance) but variable in color (high variance).
  • This learned structure creates a strong prior expectation that any new category will also be homogeneous in shape.
  • Consequently, the model enables rapid generalization of novel labels based on shape similarity, effectively ignoring differences in color.

Application to Noun Generalization Task

Glassen & Nitsch (2016) Griffiths et al. (2024) Kemp et al. (2007)

Table 1: Training Data
1
2
3
4
Category 1 1 2 2 3 3 4 4
Shape 1 1 2 2 3 3 4 4
Texture 1 2 3 4 5 6 7 8
Color 1 2 3 4 5 6 7 8
Size 1 2 1 2 1 2 1 2
  • Two exemplar per category (columns)
  • Different feature dimensions (shape, texture, color size)
  • Pairs of objects belonging to the same category share the same shape!
Table 2: Testing Data
'Dax'
Object 1
Object 2
Object 3
Category 5 ? ? ?
Shape 5 5 6 6
Texture 9 10 9 10
Color 9 10 10 9
Size 1 1 1 1

After training, children (and the model) encounter a new object with a novel noun “dax”.

Task: Which of the three candidates with unkown label categories is most likely to be a dax?

Data based on Smith et al. (2002)

Results of Noun Generalization Task

  • 19-month-olds who received the structured training choose the shape match
  • Untrained 19-month-olds choose randomly
  • The hierarchical Bayesian model shows the same preference pattern as trained children

Summary

Test

References

Glassen, T., & Nitsch, V. (2016). Hierarchical Bayesian models of cognitive development. Biological Cybernetics, 110, 217–227. https://doi.org/10.1007/s00422-016-0686-6
Griffiths, T. L., Chater, N., & Tenenbaum, J. (2024). Bayesian Models of Cognition: Reverse Engineering the Mind. MIT Press. https://mitpress.mit.edu/9780262049412/bayesian-models-of-cognition/
Kemp, C., Perfors, A., & Tenenbaum, J. B. (2007). Learning overhypotheses with hierarchical Bayesian models. Developmental Science, 10(3), 307–321. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-7687.2007.00585.x
Smith, L. B., Jones, S. S., Landau, B., Gershkoff-Stowe, L., & Samuelson, L. (2002). Object name learning provides on-the-job training for attention. Psychological Science, 13(1), 13–19. https://doi.org/10.1111/1467-9280.00403